Skip to content

feat: Impl IncrementalAppendScan#590

Merged
wgtmac merged 4 commits into
apache:mainfrom
WZhuo:incremental_append_scan
Mar 27, 2026
Merged

feat: Impl IncrementalAppendScan#590
wgtmac merged 4 commits into
apache:mainfrom
WZhuo:incremental_append_scan

Conversation

@WZhuo

@WZhuo WZhuo commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@WZhuo WZhuo force-pushed the incremental_append_scan branch 3 times, most recently from 9b822d0 to 4c4ffff Compare March 13, 2026 03:45
Comment thread src/iceberg/table_scan.h

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was generated by Gemini.

Comment thread src/iceberg/table_scan.cc Outdated
Comment thread src/iceberg/table_scan.h Outdated
Comment thread src/iceberg/table_scan.h Outdated
@WZhuo WZhuo force-pushed the incremental_append_scan branch 2 times, most recently from d1bacdf to 5c41a78 Compare March 23, 2026 03:30
@WZhuo WZhuo force-pushed the incremental_append_scan branch 2 times, most recently from 3f97b02 to 8e2ad99 Compare March 24, 2026 06:10

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was generated by Gemini.

Summary & Recommendation

  • Request Changes: There are logic bugs in handling empty table snapshots and a potential hashing/equality discrepancy for manifest files that need to be addressed.

Comment thread src/iceberg/table_scan.cc
Comment thread src/iceberg/manifest/manifest_list.h
@wgtmac

wgtmac commented Mar 25, 2026

Copy link
Copy Markdown
Member

Mostly look good. I will merge this after two remaining issues have been addressed.

@WZhuo WZhuo force-pushed the incremental_append_scan branch from 8e2ad99 to f3cd338 Compare March 26, 2026 10:03
@wgtmac wgtmac merged commit cdf05d6 into apache:main Mar 27, 2026
12 checks passed
kevinjqliu pushed a commit to apache/iceberg-python that referenced this pull request Jun 28, 2026
<!-- Closes #2634 -->

Closes #2634.

# Rationale for this change

Adds `IncrementalAppendScan`, which reads the data appended between two
snapshots — the building block for incremental ingestion. Largely a
revival of the work in #2235; see #2634 and the previous PRs for
motivation.

Split out of #3364 at the reviewers' request, and builds on the
now-merged `BaseScan` / `ManifestGroupPlanner` refactor (#3511), so this
PR's diff is the append-scan feature alone.

The surface mirrors Iceberg-Java's engine-facing API (snapshot IDs,
inclusive/exclusive start, optional start) rather than the narrower
Spark read options, since PyIceberg is increasingly used by engines
(e.g. Polars). See the API discussion on this PR.

References: https://github.com/apache/iceberg (Iceberg-Java and Spark)
and apache/iceberg-cpp#590. Inline review-aid
comments (prefixed `[AI reviewer aid]`) point at the relevant reference
code.

# API

`Table.incremental_append_scan(...)` returns an `IncrementalAppendScan`;
`StagedTable` overrides it to raise, mirroring `scan()`. The scan reads
the rows added by **append** snapshots in `(from, to]`, projected onto
the table's current schema; delete / overwrite / replace snapshots in
the range (e.g. compaction) are ignored.

The range is set via the factory's Spark-style kwargs or the builder
methods, each of which returns a refined copy (like `select()` /
`filter()`):

```python
table.incremental_append_scan(
    from_snapshot_id_exclusive=None,   # optional; defaults to the oldest ancestor of `to`
    to_snapshot_id_inclusive=None,     # optional; defaults to the current snapshot
    row_filter=..., selected_fields=..., case_sensitive=..., options=..., limit=...,
)

scan.from_snapshot_id_exclusive(id)    # or .from_snapshot_id_inclusive(id)
    .to_snapshot_id_inclusive(id)
```

The range is held as public attributes — `from_snapshot_id` +
`from_snapshot_inclusive` + `to_snapshot_id` — a single start slot plus
an inclusive flag, mirroring Java's `TableScanContext` and consistent
with the other scans.

# Changes

- Range resolution mirrors Java's `BaseIncrementalScan`: an unset start
scans from the oldest ancestor of the end; an inclusive start resolves
to its parent as the exclusive boundary; an exclusive start is validated
with `is_parent_ancestor_of`, so an expired start cursor is accepted as
long as the lineage still passes through it; the end defaults to the
current snapshot; an empty table with no range set scans nothing.
- Planning walks the append-only ancestors in the range, dedups the data
manifests whose `added_snapshot_id` is in range (set semantics via
`ManifestFile.__eq__` / `__hash__`), and filters manifest entries to
`ADDED`-in-range via a new `manifest_entry_filter` on
`ManifestGroupPlanner.plan_files`. Compacted (`rewrite_data_files`)
output is therefore not picked up — no double counting.
- Projects onto the table's **current** schema (matching Java/C++), so
rows written under an older schema in the range get `NULL` for newer
columns.
- Adds snapshot helpers `ancestors_between_ids`, `is_ancestor_of`, and
`is_parent_ancestor_of`.
- Arrow materialization (`to_arrow` / `to_arrow_batch_reader`) is shared
with `DataScan` via small module-level helpers that take the projected
schema explicitly, so `BaseScan` stays projection-free (per the #3511
review).

# Out of scope (tracked follow-ups)

- Branch selection (`use_branch`) and per-endpoint ref/tag start & end
(`from_ref_*` / `to_ref_*`) — the rest of the engine-facing surface Java
exposes.
- `count()`, REST server-side planning, and user-facing doc examples
(`mkdocs`).
- `dictionary_columns` on `IncrementalAppendScan.to_arrow` /
`to_arrow_batch_reader` (added to `DataScan` in #3461; the shared
helpers already thread it) — kept out to isolate this PR.

# Are these changes tested?

Yes — unit tests (range resolution including unset / inclusive /
exclusive and expired start, current-schema projection, builder and
`update()` copies, empty table, staged-table guard) and integration
tests (append-only, non-append snapshots ignored, compaction not
double-counted, schema evolution within range, partition- and
metrics-evaluator pruning, disconnected snapshots), plus the
`test_incremental_read` provision fixture.

# Are there any user-facing changes?

Yes — the new `Table.incremental_append_scan(...)` API and
`IncrementalAppendScan` class. No changes to existing public surface.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@WZhuo WZhuo deleted the incremental_append_scan branch July 2, 2026 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants